docs: add fault injection test plan#139
Conversation
63d2124 to
5a1da86
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 386686f2d1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| local baseline_nodes baseline_tenants test_pid rc current_time health_checks | ||
| preflight "$scenario" | ||
| mkdir -p "$artifacts" | ||
| baseline_nodes="$(kubectl_cluster get nodes -o json | jq -r '.items | length')" |
There was a problem hiding this comment.
Count ready nodes consistently in the health baseline
When the real test cluster has any pre-existing NotReady node, preflight can still pass because it only requires at least four schedulable Ready nodes, but this baseline records all nodes and health_is_safe compares it to the current Ready-node count. That makes fault-run fail the health guard immediately on otherwise usable dedicated clusters; capture the same Ready-node predicate here or reject NotReady nodes during preflight.
Useful? React with 👍 / 👎.
| pv_count="$(kubectl_cluster get pv -o json | jq -r --arg storage_class "$storage_class" ' | ||
| [.items[] | ||
| | select(.spec.storageClassName == $storage_class) | ||
| | select(.status.phase == "Available" or .status.phase == "Bound") |
There was a problem hiding this comment.
Reject PVs already bound outside the fault tenant
For dm-flakey, this count treats every Bound 100Gi PV in the selected no-provisioner StorageClass as usable. If one of the four PVs is already bound to another namespace or application, preflight still passes even though the fault Tenant cannot claim it (or the run is pointing at non-dedicated storage), so the scenario later hangs/fails after mutating the fault namespace. Only count Available PVs plus Bound PVs whose claimRef belongs to the owned fault namespace/tenant.
Useful? React with 👍 / 👎.
| pub fn from_env() -> Result<Self> { | ||
| let context = current_context()?; | ||
| Self::from_env_with(|name| std::env::var(name).ok(), context) |
There was a problem hiding this comment.
Enforce the expected context inside the Rust fault harness
If someone runs the ignored faults test binary or cargo test directly with RUSTFS_FAULT_TEST_DESTRUCTIVE=1, this path accepts whatever kubeconfig context is current and only rejects kind-*; the documented RUSTFS_FAULT_TEST_EXPECTED_CONTEXT guard exists only in the shell wrapper. That leaves destructive namespace/PVC/Chaos cleanup one stale kubectl config use-context away from the wrong real cluster, so the Rust config should also require and compare the expected context before constructing ClusterTestConfig.
Useful? React with 👍 / 👎.
Type of Change
Related Issues
N/A
Summary of Changes
Adds a Chinese fault injection test plan for the RustFS Operator e2e harness.
The plan documents:
Checklist
make pre-commit(fmt-check + clippy + test + console-lint + console-fmt-check)[Unreleased](if user-visible change)Impact
Verification
Additional Notes
No runtime code changes are included. This PR only adds the detailed e2e fault injection design document.
Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.